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Abstract 

Background: Standard numbering schemes for families of homologous proteins allow for the unambiguous 
identification of functionally and structurally relevant residues, to communicate results on mutations, and to 
systematically analyse sequence-function relationships in protein families. Standard numbering schemes have been 
successfully implemented for several protein families, including lactamases and antibodies, whereas a numbering 
scheme for the structural family of thiamine-diphosphate (ThDP) -dependent decarboxylases, a large subfamily of 
the class of ThDP-dependent enzymes encompassing pyruvate-, benzoylformate-, 2-oxo acid-, indolpyruvate- and 
phenylpyruvate decarboxylases, benzaldehyde lyase, acetohydroxyacid synthases and 2-succinyl-5-enolpyruvyl-6- 
hydroxy-3-cyclohexadiene-1-carboxylate synthase (MenD) is still missing. 

Despite a high structural similarity between the members of the ThDP-dependent decarboxylases, their sequences 
are diverse and make a pairwise sequence comparison of protein family members difficult. 

Results: We developed and validated a standard numbering scheme for the family of ThDP-dependent 
decarboxylases. A profile hidden Markov model (HMM) was created using a set of representative sequences from 
the family of ThDP-dependent decarboxylases. The pyruvate decarboxylase from 5. cerevisiae (PDB: 2VK8) was 
chosen as a reference because it is a well characterized enzyme. The crystal structure with the PDB identifier 2VK8 
encompasses the structure of the 5cPDC mutant E477Q, the cofactors ThDP and Mg 2+ as well as the substrate 
analogue (2S)-2-hydroxypropanoic acid. The absolute numbering of this reference sequence was transferred to all 
members of the ThDP-dependent decarboxylase protein family. Subsequently, the numbering scheme was 
integrated into the already established Thiamine-diphosphate dependent Enzyme Engineering Database (TEED) and 
was used to systematically analyze functionally and structurally relevant positions in the superfamily of 
ThDP-dependent decarboxylases. 

Conclusions: The numbering scheme serves as a tool for the reliable sequence alignment of ThDP-dependent 
decarboxylases and the unambiguous identification and communication of corresponding positions. Thus, it is the 
basis for the systematic and automated analysis of sequence-encoded properties such as structural and functional 
relevance of amino acid positions, because the analysis of conserved positions, the identification of correlated 
mutations and the determination of subfamily specific amino acid distributions depend on reliable multisequence 
alignments and the unambiguous identification of the alignment columns. The method is reliable and robust and 
can easily be adapted to further protein families. 
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Background 

Thiamine diphosphate (ThDP) -dependent decarboxy- 
lases are a large subfamily of the class of ThDP- 
dependent enzymes which are essential in many biosyn- 
thetic pathways. Due to the scientific and industrial rele- 
vance of enzymes capable of catalysing C-C bond 
formation and cleavage, we have focused in this work on 
the decarboxylase superfamily of the ThDP-dependent 
Enzyme Engineering Database (TEED) [1]. This super- 
family contains among others pyruvate decarboxylases 
(PDCs, EC 4.1.1.1), indolepyruvate decarboxylases (IPDCs, 
EC 4.1.1.74), phenyl pyruvate oxidases (POXs, EC 1.2.3.3), 
the El component of pyruvate dehydrogenases (PDHs, EC 
1.2.4.1), oxalyl-CoA decarboxylases (OCDCs, EC 4.1.1.8), 
benzaldehyde lyases (BALs, EC 4.1.2.38), benzoylformate 
decarboxylases (BFDs, EC 4.1.1.7), acetohydroxyacid 
synthases (AHASs, EC 2.2.1.6), glyoxylate carboligases 
(GXCs, EC 4.1.1.47), sulfoacetaldehyde acetyltransferases 
(SAATs, EC 2.3.3.15), 2-hydroxyphytanoyl-CoA lyases 
(2-HPCLs) and 2-succinyl-5-enolpyruvyl-6-hydroxy-3- 
cyclohexadiene-l-carboxylate synthase (SEPHCHC, MenD). 
Despite low sequence similarities between sequences of 
the decarboxylase superfamily of the TEED (~ 20%), 
their structures are highly similar. The structures con- 
sist of three domains, the N- and C-terminal domains 
are involved in binding of the cofactor ThDP and are 
named pyrimidine (PYR) and pyrophosphate (PP) bind- 
ing domain [2,3], respectively. They are separated by a 
third domain, which is less conserved and adopts differ- 
ent functions in the various enzyme families, e.g. by 
binding additional cofactors such as ADP [4] and FAD 
[5] or activators and inhibitors [6]. Due to structural 
relations between this middle domain and the transhy- 
drogenase domain dill, this domain is called the TH3 
domain [2,3]. 

Although all ThDP-dependent decarboxylases share 
the same fold and a similar mechanism utilising the co- 
factor ThDP, they catalyse a broad range of different 
reactions involving cleavage and formation of C-C bonds 
[7-9]. While the decarboxylation of 2-ketoacids [10] and 
the carboligation of two aldehydes to 2-hydroxy ketones 
are catalysed by most members of the ThDP-dependent 
decarboxylases [9], their substrate ranges are different. 
The well characterised PDC from Saccharomyces cerevi- 
siae, BFD from Pseudomonas putida and BAL from 
Pseudomonas fluorescence accept a broad variety of sub- 
strates [7,11,12], while SEPHCHC-synthase (MenD) is 
limited to a small number of substrates [13,14]. Add- 
itional complexity of C-C bond formation results from 
the fact that a substrate might be either a donor, which 
is activated by addition to ThDP in the active site, or an 
acceptor, which reacts with the ThDP-bound donor, 
resulting in different products [7,11,12]. Reactions cata- 
lysed by members of the structural group of ThDP- 



dependent decarboxylases include decarboxylation of 2- 
keto acids, synthesis of various chiral 2-hydroxy ketones 
by asymmetric benzoin- [11,15] and cross-benzoin con- 
densation [16,17], the racemic resolution of 2-hydroxy 
ketones via C-C bond cleavage [18], and Stetter-like 
reactions, e.g. the addition of decarboxylated 2- 
ketoglutyrate to isochorismate by MenD [19]. With the 
exception of a few functionally relevant residues that 
have been identified by comparing sequences and struc- 
tures of homologous proteins or by mutation experi- 
ments, the molecular basis of this biochemical diversity 
is still unknown. Variants have been developed by ra- 
tional design and by directed evolution, in order to im- 
prove the activity of members of this enzyme family 
[16,20,21] or to alter substrate specificity [22-28] or 
stereoselectivity [29-31]. Some functionally relevant 
amino acids are located in the active site, mediating sub- 
strate binding [3], are involved in the activation of ThDP 
[28] or steer stereoselectivity [29-31], e.g. the «S-pocket 
as part of the acceptor binding site, which has been 
shown to contribute to the stereoselectivity of several 
members of the decarboxylase superfamily [29-31]. 
However, due to this complexity, combining results 
yielded from different variants of different protein fam- 
ilies, consolidating results on the function of specific 
residues and comparing results from different research 
groups is unfortunately not a straightforward process. 
An additional challenge in this respect is the identifica- 
tion of homologous positions in sequences of different 
proteins, in order to allow for their comparison. Amino 
acid exchanges in enzyme variants are usually identified 
by a number, signifying the absolute position of the 
amino acid in the respective protein in combination with 
the original and the newly introduced amino acid. This 
method only yields comparable results if the numbering 
is based on exactly the same sequence. In reality how- 
ever, published results often are based on slightly differ- 
ent protein sequences, often missing residues at the N- 
terminus or based on sequences derived from crystal 
structures. This makes the comparison of results con- 
cerning individual residues of one specific protein from 
different research groups or the comparison of results 
on homologous proteins manually intensive and pre- 
vents the use of automated tools for a large number of 
sequences. Therefore, an unambiguous numbering 
scheme for all members of the decarboxylase superfam- 
ily would be desirable. The usefulness of a generally 
accepted numbering scheme was demonstrated for the 
class A and B enzyme families of (3-lactamases [32,33]. 
Based on structure-guided multisequence alignments of 
reference sequences [34], a number was assigned to each 
column of the alignment. Thus, each amino acid could 
be addressed unambiguously and consistently for all 
sequences. This numbering scheme is widely applied for 
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the identification of key residues and for the naming of 
variants [34]. The numbers assigned by this scheme 
might differ by more than 20 from the absolute amino 
acid numbering of a respective protein. Without a stand- 
ard numbering scheme, the systematic comparison of 
mutations would have to be done manually and would 
be error-prone. For the same reasons, a standard num- 
bering scheme was established for complementary deter- 
mining regions (CDRs) of antibodies, thus allowing for a 
systematic analysis and an unambiguous communication 
between research groups [35,36]. The numbering 
schemes were initially based on limited sets of protein 
sequences and were subsequently refined as more se- 
quence and structure data became available. In order to 
provide a standard numbering which is independent 
from the increasing sequence space, a numbering 
scheme based on one defined reference sequence would 
be desirable. Due to the low sequence similarity between 
ThDP-dependent decarboxylases from different homolo- 
gous families, it would not be reliable to transfer the ab- 
solute position numbers of the reference sequence to the 
residues of any decarboxylase sequence based on pair- 
wise alignments. To handle this challenge, we chose a 
structure-based and profile-guided approach for the 
transfer of position numbers. In this work, we present 
the establishment of a numbering scheme for the ThDP- 
dependent decarboxylases based on the sequence of the 
well-documented pyruvate decarboxylase from S. cerevi- 
siae (PDB: 2VK8 [6], Swissprot: P06169). The numbering 
scheme was validated by comparing its ability to 
produce multisequence alignments to the T-Coffee 



alignment algorithm and by revision of the structural 
equivalence of positions with the same standard num- 
bers. Using this numbering scheme, the decarboxylase 
superfamily was systematically analysed for conserved 
amino acids. 

Results 

Implementation and validation of a standard numbering 
scheme 

A standard numbering scheme for the decarboxylase 
superfamily of ThDP-dependent enzymes was estab- 
lished using the ThDP-dependent Enzyme Engineering 
Database (TEED). A profile hidden Markov model was 
created from a structure-guided multisequence align- 
ment of 16 representative proteins of the decarboxylase 
superfamily (Table 1). One of the representative pro- 
teins, the pyruvate decarboxylase from S. cerevisiae 
(ScPDC, Swissprot: P06169, PDB: 2VK8 [6]), was used as 
the reference sequence for numbering all proteins of the 
decarboxylase superfamily. In addition, 22 functionally 
and structurally relevant residues in the sequence of 
ScPDC were annotated as described in literature 
[2,5,28,30,31,37-39] (Additional file 1: Table SI). These 
positions include the highly conserved active site resi- 
dues E51 (standard numbering) [40-43], the conserved 
HH motif in PDCs (H114/H115) [28], the GDGX motif 
443-446 and the Mg 2+ binding site N471 [39], as well as 
more variable regions such as the «S-pocket residues P26, 
G27, 1476, and Q477 [29-31] and the start and end pos- 
ition of the three decarboxylase domains, the PYR, PP, 
and the TH3 domain [2] . 



Table 1 The set of 16 representative proteins used for establishing a standard numbering scheme 



Protein 


Organism 


PDB-identifier 


pyruvate decarboxylase 


5. cerevisiae 


2VK8 


2-succinyl-5-enolpyruvyl-6-hydroxy-3-cycloheaxdiene-1-carboxylate synthase 


E. coli 


2JLC 


pyruvate decarboxylase 


Z. mobilis 


1ZPD 


branched-chain keto acid decarboxylase 


L. lactis 


2VBF 


benzoylformate decarboxylase 


P. putido 


1BFD 


carboxyethylarginine synthase 


S. clavuligerus 


2IHT 


cyclohexene-1,2-dione hydrolase 


Azoarcus sp. 


2PGN 


oxalyl-CoA decarboxylase 


0. formigenes 


2C31 


pyruvate oxidase 


A. viridons 


1V5F 


pyruvate dehydrogenase 


E. coli 


3EYA 


indolepyruvate decarboxylase 


E. cloacae 


10VM 


acetohydroxyacid synthase 


S. cerevisiae 


use 


acetohydroxyacid synthase 


A. thaliana 


1YBH 


acetohydroxyacid synthase 


K. pneumoniae 


10ZF 


benzaldehyde lyase 


P. fluorescens 


2AG0 


glyoxylate carboligase 


E. coli 


2PAN 



Of each PDB entry, chain A was used for alignment. It was verified that for all proteins chain A corresponds to the catalytic subunit. 
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In contrast to the PYR and the PP domain, the sec- 
ondary structure elements of the TH3 domains of differ- 
ent decarboxylases vary considerably near their N- and 
C-terminus, thus leading to numerous gaps in the align- 
ment at these positions. Therefore, the start of the TH3 
domain was shifted four positions downstream and the 
end was shifted five positions upstream into regions, 
which were free of gaps, though sequence conservation 
was still low. 

The absolute amino acid numbers and annotation in- 
formation were transferred from the reference sequence 
to the respective positions of all members of the decarb- 
oxylase superfamily by aligning them to the profile 
HMM. 

A web application was integrated into the web inter- 
face of the TEED (www.TEED.uni-stuttgart.de) to pro- 
vide public access to the numbering tool. Upon 
submission of a single query sequence or a list 
of sequences in FASTA format, the standard number- 
ing is applied and the sequence including the number- 
ing and annotations for each amino acid can be 
downloaded (Figure 1; a description of the file format is 
given in the Additional file 1, a sample is given in the 
Additional file 2). 



The accuracy of the HMM-based alignment was com- 
pared to a multisequence alignment using T-Coffee [44] 
by aligning the reference sequence ScPDC and 15 
sequences from the decarboxylase family for which 
structural information was available but which were not 
part of the set of representative proteins. To determine 
the differences between the HMM-based alignment and 
the T-Coffee alignment, all columns were compared be- 
tween the two alignments and a similarity score was 
assigned to each column (Additional file 3). Alignment 
columns were "identical" if both alignment algorithms 
placed the same residues for all sequences into the re- 
spective columns; "highly similar" if the two alignments 
differed in 1-3 sequences; "similar" if 4-8 mismatches 
were observed; "dissimilar" if 9 - 12 sequences dif- 
fered at the respective position; "divergent" if the 
alignments differed in 13 - 15 of the 15 sequences. 
As a result, 73% of all columns were identical or 
highly similar in both alignments (Figure 2). For those 
columns which deviated considerably between the two 
alignments (dissimilar or divergent columns), a struc- 
tural comparison revealed that in almost all cases the 
HMM-based alignment represented the structural 
equivalence better than the multisequence alignment 



query: M YT|GDY LLDRLHELGI 

reference: -MSEIt|gKY LFERLKQVNV 

query: N|LNASYMAD GYARTKKAAA 

reference: n|lNAAYAAD GYARIKGMSC 



EEIFGVPGDY NLQFLDQIIS REDMKWIGNA 47 

NTVFGLPGDF NLSLLDKIYE VEGMRWAGNA 49 

FLTTFGVGEL SAINGLAGSY AENLPWEIV 97 

IITTFGVGEL SALNGIAGSY AEHVGVLHW 99 



query: GSPTSKVQND GKFVHHTLAD GDFKHFMRMH EPVTAARTLL TA-ENATYEI 146 

reference: GVPSISAQAK QLLLHHTLGN GDFTVFHRMS ANISETTAMI TDIATAPAEI 149 



query: 
reference : 



query: 
reference : 



DRVLSQLLKE RKPVYIN! 
DRCIRTTYVT QRPVYLG! 



VILSKIEESL KNAQKPWIA- 



DVAAAKAEKP ALSLEKESST -TNT T| 

LLQ-TPIDMS LKPNDAESEK 



PYR-end 
standard: 168 
query: 165 



Jktvtqfvset klpittlnfg 



EVIDTILVLD KDAKNPVILA DACCSRHDVK AETKKLIDLT QFPAFVTPMG 



192 
192 



242 
24E 



query: KSAVDESLPS FLGIYNGKLS EISLKNFVES ADFILMLGVK LTDSSTGAFT 292 

reference: KGSIDEQHPR YGGVYVGTLS KPEVKEAVES ADLILSVGAL LSDFNTGSFS 29S 



query: HHLDENKMIS LNIDEGIIFN KWEDFDFRA WSSLSEL-K G — I EYE GO. Y 

reference: YSYKTKNIVE FHSDHMKIRN ATFPGVQMKF VLQKLLTTIA DAAKGYK-P- 



query: IDKQ YE- E — FIPSSAP LSiDRLWQAV ESLTQSNETI VAEQGTSFFG 

reference: -VAVPARTPA NAA-VPASTP LK|EWMWNQL GNFLQEGDW IAETGTSAFG 



query: ASTIFLKSNS RFIGQPLWGS IGYTFPAALG SQIAD K ESRHLLFI 

reference: INQTTFPNNT YGISQVLWGS IGFTTGATLG AAFAAEEIDP KKRVILFI' 



C-D 



query: 
reference : 



GSLQLTVQEL GLSIREKLNP ICFII1 
GSLQLTVQEI STMIRWGLKP YLFVL] 



nIdBy 

NfD§Y 



TVEREIHGPT QSYNDIPMWN 
TIQKLIHGPK AQYNEIQGWD 



query: YSKLPETFGA TEDRWSKIV RTENEFVSVM KE — AQADVN RMYWIELV 

reference: HLSLLPTFG- AKD-YETHRV ATTGEWDKLT QDKSFN-DNS KIRMIEVM|P 



339 
346 



383 
394 



429 
444 



4 "9 
494 



52" 
541 



query : 
reference : 



KEDAPK LLKKMGK LFAEQNK 

VFDAPQNLVE QAKLTAATNA KQ 



Figure 1 Alignment of a query sequence and the reference sequence from the web interface of the numbering method. Alignment of a 
query sequence (here: branched-chain alpha-ketoacid decarboxylase from L. lactis, genbank: 75369656) to the reference sequence (ScPDC, 
swissprot: P06169, PDB: 2VK8). By positioning the cursor on an amino acid (here: proline), the standard numbering (here: 168) as derived from the 
reference sequence and the absolute numbering of the respective query sequence (here: 165) as well as annotation information (here: end of the 
PYR domain) are displayed. All residues are highlighted for which annotation information is available in the TEED [1]. 



Vogel et al. BMC Biochemistry 2012, 13:24 
http://www.biomedcentral.eom/1 471-2091/1 3/24 



Page 5 of 10 




identical highly similar similar dissimilar divergent 

column similarity [%] 



Figure 2 Analysis of accordance of two multisequence alignments. The comparison of columns of two multisequence alignments of 15 
sequences using the numbering method and T-Coffee revealed five types of column similarity. 48% of the investigated columns were identical in 
both alignments, 25% of the columns were "highly similar" (up to 3 mismatches out of 15 sequences), 5% were "similar" (4-8 mismatches), 12% 
of the columns had 9 to 12 mismatches and are therefore called "dissimilar" and 10% of the columns showed more than 12 mismatches 
("divergent"). 

V J 



by T- Coffee (Additional file 1: Figures SI, and S2). 
In addition, it was verified that all 22 functionally 
relevant positions were aligned correctly (Additional 
file 1: Table SI). 

Identification of conserved residues and domain boundaries 

After having applied a standard numbering scheme for 
all 3000 members of the decarboxylase superfamily, the 
respective protein sequences were systematically ana- 
lysed for the occurrence of amino acids at corresponding 
positions. Four groups of positions with different charac- 
teristics of conservation were found. 

The first group includes 6 positions which were con- 
served in more than 90% of all members of the decarb- 
oxylase superfamily, while no other amino acid occurred 
in more than 1% of the sequences: Position 27 (standard 
numbering) in the 5-pocket which was glycine in 91% of 
all members of the decarboxylase superfamily, position 
443 in the GDGX motif which was glycine in 98% of all 
decarboxylases, and four highly conserved positions 
which have not yet been identified as being of functional 
or structural relevance: positions 58 (alanine in 96% of 
the sequences), 94 (proline in 91% of the sequences), 
219 (glycine in 91% of the sequences), and 286 (glycine 
in 97% of the sequences). Thus, 6 positions (mostly gly- 
cine residues) are highly conserved in almost all mem- 
bers of the decarboxylase superfamily. 

The second group includes 3 positions in which one 
amino acid was found in a majority of more than 90% of 
all members of the decarboxylase superfamily and a dif- 
ferent amino acid in a minority (> 3%) of all sequences. 



The most conserved position was the active site residue 
Glu 51. This conserved glutamic acid was found in 94% 
of all sequences, while 3% have a valine in this position. 
D444 of the GDGX motif was conserved in 91% of all 
cases, while 7% have a glutamic acid in this position. At 
position 280, aspartic and glutamic acid were found in 
90% and 4%, respectively, of all members of the decarb- 
oxylase superfamily. Thus, this group includes positions 
which seem to be characteristic for a distinct subgroup 
of this superfamily. 

The third group encompasses variable positions which 
are known to be involved in substrate recognition or ca- 
talysis. In positions 114 and 115, the majority of all 
members of the decarboxylase superfamily have a 
phenylalanine (58%) and a glutamine (81%), respectively, 
while a minority, predominantly PDCs, show histidine 
(15% and 12%, respectively) in these positions. These 
histidines have been referred to as the HH-motif in the 
PDC family [28]. A functionally relevant, though highly 
variable site, is the 5-pocket which contributes to the 
stereo selectivity of decarboxylases [29-31]. Two posi- 
tions, 476 and 477, which were shown to contribute to 
the 5-pocket or the entrance of the 5-pocket, were 
highly variable in all members of the decarboxylase 
superfamily. In standard position 476 most members of 
the decarboxylase superfamily show a methionine (42%) 
or an isoleucine residue (18%), respectively, while stand- 
ard position 477 is occupied by valine (45%) or isoleu- 
cine (20%), respectively. 

The fourth group included the domain boundaries of 
the three protein domains PYR, PP and the TH3 do- 
main. Identification of the domain boundaries can be 
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easily accomplished when structural information is avail- 
able, whereas an identification of domain boundaries 
based on the amino acid sequence alone is not straight- 
forward due to the low sequence similarity in the loop 
regions connecting the three domains. However, align- 
ments using the profile HMM revealed several con- 
served positions: the start of the PYR domain (standard 
numbering 6) is indicated by a conserved glycine (in 
44% of all sequences), while its end (position 168) is 
highly conserved (proline in 87% of all cases). Similarly, 
the PP domain starts at position 367 (proline in 54% of 
all sequences) and ends at position 540 (valine in 37% of 
all sequences). These four positions coincided well with 
the start and end of the ThDP-binding fold. In contrast, 
the start and end positions of the TH3 domain were 
highly variable. Therefore, two positions further inside 
the TH3 domain were selected to characterise the start 
and the end of this domain: positions 197 (aspartic acid 
in 18% of all cases) and 336 (lysine in 17% of all 
sequences). Despite the low sequence similarity in the 
boundary region, the assignment of standard numbers 
was consistent with the results from a structural 
superimposition. 

Furthermore, the regions around the 9 highly con- 
served positions of group 1 and 2 were investigated con- 
cerning sequence conservation in order to investigate 
the presence of sequence motifs. With the exception of 
position 27 (standard numbering), their surrounding 
regions were sufficiently conserved to allow for the der- 
ivation of sequence motifs. The region around residue 
G443 is already known as the GDGX 24>27 N-motif [45]. 
In order to analyse the specificity and the precision of 
the remaining motifs for the decarboxylase superfamily, 
they were used in a motif search against the non- 
redundant NCBI database, while an updated version of 
the TEED (not yet published) served as positive control 
[1]. The motif [DHN]5o-E 51 -[AEGLQ]5 2 -[AGNSTV]53- 
[AGLMV] 5 4-[AGISTV] 5 5-[FHLMY] 56 -[AFILM] 57 -A 58 , 
which was derived from the region around the con- 
served positions 51 and 58, showed similar sensitivity 
(0.65) and precision (0.27) as the PROSITE pattern 
PS00187, which is an extended version of the 
GDGX 24>27 N-motif and was described as a conserved 
motif of POXs (EC 1.2.3.3), PDCs (EC 4.1.1.1), AHASs 
(EC 2.2.1.6), BFDs and indolepyruvate decarboxylases 
(IPDCs, EC 4.1.1.74) [46-48] (sensitivity: 0.59, precision: 
0.42). This motif is part of an a-helix, which is involved 
in the formation of the active site. In addition, the motif 
surrounding position 280 had at least similar precision 
and sensitivity for ThDP-dependent decarboxylases as 
the simple GDGX 24)2 7N-motif [45] (data not shown). 
Thus, a second motif [DE] 280 -[ACFLTV] 281 -[ILMV] 282 - 
[FILV] 283 -[ACGLMNSTV] 284 -[AFILV] 285 -G 286 was iden- 
tified with 5 predominantly hydrophobic amino acids 



between two highly conserved positions D/E280 and 
G286, which form the vertices of the loops connecting a 
central (3-strand of the TH3 domain to the adjacent a- 
helices. The remaining motifs were less specific and 
sensitive for the identification of ThDP-dependent 
decarboxylases. 

Application of the numbering scheme to experimentally 
characterized positions 

An extensive literature search yielded 22 positions which 
were experimentally well characterized in five different 
proteins (ScPDC, A^PDC, ZmPDC, P/BAL and PpBFD) 
and shown to be of relevance to substrate specificity 
and/or activity. The numbering scheme was exemplarily 
applied to the respective sequences in order to compare 
the annotation information from the literature. Several 
equivalent positions in different proteins were shown to 
have different absolute numbers (Additional file 1: Table 
S2). An influence on the decarboxylase activity was 
shown for the residues D28 of ScPDC, D27 of ZmPDC 
and A28 of P/BAL, each corresponding to standard pos- 
ition 28. Furthermore, structural and functional equiva- 
lence was shown for A28 in P/BAL and S26 in i^BFD. 
Similarly, positions 114 and 115, which were described 
as the HH-motif of pyruvate decarboxylases (Additional 
file 1: Table S2) [28] are structurally and functionally 
identical in different PDCs, but differ in their absolute 
position numbers. The mutations W388A,I in ApVDC 
were shown to reduce stereoselectivity while the muta- 
tions W392A,I,M of ZmPDC led to an improved carboli- 
gation activity. However, both positions are structurally 
equivalent and are addressed with standard number 392. 
Functional relevance is also described for position 477 
(standard number) in ScPDC, ApPDC and ZmPDC. All 
mutations of the respective residues (E477Q in ScPDC, 
E469G in ApPDC and E473D,Q) revealed an impact on 
the decarboxylation reaction [23-25,29]. The examin- 
ation of these five examples and the differences be- 
tween the absolute and the standard numbers of 
functionally equivalent positions showed, that the pre- 
sented numbering scheme for the ThDP-dependent 
decarboxylases eases the communication on variants 
and the comparison of functionally relevant positions. 
The assignment of standard numbers to positions of 
different homologous proteins furthermore simplifies 
the prediction of the impact of mutations at equivalent 
positions. 

Discussion 

A standard numbering scheme has been established for 
the structural superfamily of ThDP-dependent decarbox- 
ylases, as it has been done previously for two protein 
families, the (3-lactamase family and the complementary 
determining regions of antibodies [34,35]. A standard 
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numbering scheme for a protein family enables an un- 
ambiguous communication between research groups 
about corresponding positions in different proteins and 
supports the automated systematic analysis of sequences 
and the classification of proteins into sub-groups [49] . In 
principle, a numbering scheme could be established by 
performing pairwise alignments of each sequence of the 
protein family to a reference sequence. However, al- 
though structurally conserved, the superfamily of ThDP- 
dependent decarboxylases shows only low sequence 
similarity. As a consequence, pairwise alignments are in 
general not reliable. As an alternative, multisequence 
alignment methods were successfully applied to align 
homologous proteins with low sequence similarity [50]. 
By performing a multisequence alignment of all 
sequences of the decarboxylase superfamily, the number- 
ing of a reference sequence could be transferred to each 
aligned decarboxylase sequence. However, a new align- 
ment has to be calculated for each new sequence to be 
included. Calculating multisequence alignments of many 
thousands of sequences with low sequence similarity are 
not only computationally intensive, but more import- 
antly, they lack robustness, because the alignment might 
change upon inclusion of additional sequences. In con- 
trast, profile hidden Markov models (HMM) based on a 
structure-driven alignment are a robust description of 
protein families and allow the user to align new 
sequences to an existing multisequence alignment [51]. 
By alignment of a sequence to a profile built from a set 
of representative proteins, the numbering can be trans- 
ferred from the reference sequence to a query sequence. 
However, the quality of the numbering depends on 
the quality of the profile. Therefore, the proteins in the 
profile HMM were carefully selected. From each of the 
sixteen families with structural information, a represen- 
tative protein was selected for a structure-guided align- 
ment [52] to guarantee the structural equivalence in 
the reference profile. Because some members of the de- 
carboxylase superfamily show activation upon binding 
of a substrate at a second (allosteric) binding site (e.g. 
ScPDC) [6] which leads to conformational changes, the 
set of reference proteins only contained decarboxylases 
which show no substrate activation or which have been 
crystallized in complex with an allosteric activator. 
Thus, only structures of active enzymes were com- 
pared. The alignment was further manually refined in 
order to improve consistency and robustness. Since the 
presented numbering scheme is aimed to compare 
structurally equivalent positions, the method depends 
on structural similarity of the proteins in the corre- 
sponding family. Accordingly, the method can be 
adapted to other protein families matching this require- 
ment globally or at least in structurally conserved 
domains. 



By establishing a standard numbering scheme for the 
ThDP-dependent decarboxylase superfamily, the unam- 
biguous identification, numbering, and analysis of func- 
tionally and structurally relevant residues was possible. 
The analysis of conserved positions in the protein family 
of ThDP-dependent decarboxylases revealed that the 
previously observed substitution of the active site glu- 
tamate by valine in members of the glyoxylate carboli- 
gase family at standard position 51 [43,53] is indeed 
characteristic of the entire family, which indicates a dif- 
ferent mechanism in glyoxylate carboligases [53]. It 
could also be shown that the active site "HH-motif" 
which has been described for various members of the 
decarboxylase superfamily [28] is highly specific for only 
a small number of decarboxylases, the pyruvate decar- 
boxylases, indolepyruvate decarboxylases, and phenyl- 
pyruvate decarboxylases, and is not present in the 
majority of the enzymes. The four highly conserved gly- 
cine residues at standard positions 27, 219, 286 and 443 
are all located between the C-cap of a (3-strand and the 
N-cap of an a-helix of p-a-p supersecondary structure 
elements, which has been shown to be a typical pattern 
for a-p units [54]. These elements presumably are rele- 
vant for the correct folding of the ThDP-dependent 
decarboxylases. 

The assignment of standard numbers to experimen- 
tally well characterized positions allows for an easy com- 
parison of positions between different proteins and 
different organisms regarding their structurally equiva- 
lence. This was demonstrated by an in-depth analysis of 
five different members of the decarboxylase superfamily 
(Additional file 1: Table S2). Several positions were iden- 
tified which share the same standard numbers, show 
similar functional influence and are structurally equiva- 
lent, but deviate in their absolute position numbers by 
up to 8 positions. Prediction of the functional influence 
of mutations in homologous sequences based on the ab- 
solute position numbers of given sequences is not 
straightforward, but becomes feasible using a standard 
numbering scheme. Thus, new sequence motifs were 
found by systematically analysing the amino acid distri- 
bution at each position of all members of the ThDP- 
dependent decarboxylase family. A new family-specific 
sequence motif was derived from the conserved region 
near the catalytic glutamic acid at position 51 (standard 
position) and the conserved alanine at position 58 
(standard position). The respective motif was shown to 
be as sensitive and precise for the ThDP-dependent dec- 
arboxylases as the PROSITE pattern PS00187, but due 
to the defined E51, it cannot be used to identify glyoxy- 
late carboligases, which have a valine at the respective 
position [53]. In addition, despite the higher variability 
of the TH3 domain in comparison to the PYR and the 
PP domain, the sequence of a p-strand found in the TH3 
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domain (standard positions 280-286) consists of a con- 
served motif. In contrast to the previously mentioned 
motif, this region is not part of the active site but is pre- 
sumably relevant for the structure or regulation of the 
protein. The adjacent loop region 286-304 was 
described as a part of the activation cascade of pyruvate 
decarboxylases, since this loop shows structural re- 
arrangement upon binding of an activator at the effector 
binding site at standard position 221 [6]. 

Conclusions 

By introducing a robust and reliable numbering scheme 
for the family of ThDP-dependent decarboxylases, we 
provided a frame of reference for this diverse protein 
family. Besides being a reliable tool to identify and num- 
ber residues and domain boundaries for the superfamily 
of ThDP-dependent decarboxylases, the presented im- 
plementation of a numbering scheme is generic and can 
be adapted to other protein families as well. The useful- 
ness and reliability of the presented numbering method 
was demonstrated for various examples. 

Methods 

Reference alignment and position number assignment 

16 representative members of the decarboxylase super- 
family were selected from the ThDP-dependent Enzyme 
Engineering Database [1] by three criteria: 1) Only pro- 
teins with known crystal structure were chosen for the 
reference alignment. From each of the 16 homologous 
families that contain structure information, one member 
was selected. 2) Some decarboxylases show activation 
upon binding of a substrate molecule to an allosteric 
binding site which leads to conformational changes. In 
these cases only structures were chosen which were 
crystallized in complex with a bound substrate or a sub- 
strate analogue. 3) For homologues families with more 
than one structure entry matching these criteria, the 
structure with the highest resolution was selected. 

From these 16 representative proteins, a structure-guided 
multisequence alignment was created by STAMP [52]. This 
reference alignment was manually refined to align second- 
ary structure elements and thus to reduce the number of 
gaps scattered in the alignment (Additional file 4). A family 
specific profile hidden Markov model was derived from the 
reference alignment by HMMER [55]. 

The sequence of the pyruvate decarboxylase from S. 
cerevisiae (PDB: 2VK8 [6], Swissprot: P06169, EC: 
4.1.1.1) was chosen as the reference sequence, because it 
is a widely applied and well characterized ThDP- 
dependent enzyme [6-8,56]. Standard position numbers 
were assigned by aligning the sequence of each member 
of the decarboxylase superfamily against the profile 
HMM and by subsequently transferring the absolute 



position numbers of the reference sequence to the corre- 
sponding positions of the respective decarboxylase 
sequence. 

Web tool 

An open access web application is provided to allow 
users to assign standard position number for decarboxyl- 
ase sequences (http://www.teed.uni-stuttgart.de). After 
submitting a query sequence, a BLAST search against a 
database of members of the structural group of decar- 
boxylases from the TEED [1] is performed. Only query 
sequences with an E-value less than 10" 10 are accepted 
to guarantee for a reliable sequence alignment. Then the 
query sequence is aligned to the reference alignment 
using the profile HMM, and the absolute position num- 
bers of the reference sequence are transferred to the 
query sequence. Finally, annotation information of the 
TEED such as catalytic residues, domain boundaries, or 
activator binding sites is transferred to the respective 
positions of the query sequence. 

Additional files 



Additional file 1: Figures SI and S2, Tables SI and S2, Description 
of the nvw file format. 

Additional file 2: nvw file of the pyruvate decarboxylase from 
S. cerevisiae. 

Additional file 3: Comparison of alignments generated using the 
numbering method and T-Coffee. 

Additional file 4: Reference alignment for the standard numbering 
method. 
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